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Abstract 

The properties of certain networks are determined by hidden variables that are not explicitly 
f — , measured. The conditional probability (propagator) that a vertex with a given value of the hidden 
variable is connected to k of other vertices determines all measurable properties. We study hidden 
variable models and find an averaging approximation that enables us to obtain a general analytical 
result for the propagator. Analytic results showing the validity of the approximation are obtained. 
We apply hidden variable models to protein-protein interaction networks (PINs) in which the hidden 
variable is the association free-energy, determined by distributions that depend on biochemistry 
and evolution. We compute degree distributions as well as clustering coefficients of several PINs of 
different species; good agreement with measured data is obtained. For the human interactome two 
different parameter sets give the same degree distributions, but the computed clustering coefficients 
differ by a factor of about two. This shows that degree distributions are not sufficient to determine 
the properties of PINs. 
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I. INTRODUCTION 



Physicists have recently shown that network analysis is a powerful tool to study the statis- 
tical properties of complex biological, technological and social systems of diverse kinds [H El [3]. 
Many networks exhibit a scale-free degree distribution in which the probability pk that a ver- 
tex is connected to k other vertices falls as a power p^ ~ fc -7 . This property is not sufficient to 
completely describe natural networks because such systems also exhibit degree correlations- 
the degrees of the vertices at the end points of any given edge are not independent [U EJ El E| • 
It is not surprising that natural systems depend on properties that do not appear explicitly 
in degree distributions. In particular, protein interaction networks depend on the availability 
of sufficient binding free energyjB] to cause interactions to occur (links between vertices to 
exist). 

Caldarelli et al. [9] and Soderberg [10] proposed models in which vertices are characterized 
by a fitness parameter assigned according to a chosen probability distribution. Then, pairs of 
vertices are independently joined by an undirected edge with a probability depending on the 
fitnesses of the end points. Ref. [TT] generalized these models as a class of models with hidden 
variables and presented a detailed formalism showing how to compute network properties 
using the conditional probability (propagator) that a vertex with a given value of a hidden 
variable is connected to other k vertices. This formalism, valid for any Markovian (binary) 
network, provides the generating function for the propagator, but not the propagator itself. 

The purpose of this paper is twofold. We first use a mean field approximation to derive a 
general analytic formula for the propagator, therefore finding a general approximate solution 
to to the inversion problem. This enables one to compute network properties without the use 
of a simulation procedure, thereby simplifying the computational procedure and potentially 
broadening the ability of scientists from all fields to use network theory. The validity of the 
method is assessed by comparing the results of using our approximation with published results. 
We then use this method to compute clustering coefficients of a specific hidden variable model 
for protein-protein interaction networks (PIN) from several organisms developed by us [T2] 
that previously had obtained degree distributions in agreement with measured data. We show 
that two models with the same degree distribution have very different clustering coefficients. 

We outline this in more detail. Sect. [TT] reviews the hidden variable formalism and our 
approximate solution to the inversion problem. We distinguish between sparse (which have 
been solved in Ref. [llj) and non-sparse networks which are solved here. The next section 



III studies the models of Refs. [H] and [13] . Our averaging procedure is found to work well 



for most situations. Our own model[T2] is presented in IV We present an analytic result for 
the average connection probability and extend the results of [32] to computing the clustering 
coefficients. The final section Mis reserved for a brief summary and discussion. 



II. HIDDEN VARIABLE NETWORKS 

We present the formalism for hidden variable models [TT]. The probability that a node 
has a hidden continuous variable g is given by p{g), normalized so that its integral over its 
domain is unity. This function is chosen to be an exponential in [9] [12] and a Gaussian 
in [13]. The connection probability for two nodes of g,g' is defined to be p(g,g')- This is 
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taken as a step function in [HI [13], and a Fermi function in [T2] . The two functions and 
p{g,g') can be chosen in a wide variety of ways to capture the properties of a given network. 
Reference [TT] presents the probability generating function, Gq(x), that determines pk in terms 
of the generating function for the propagator, G (z,g), as 



G (z)= j dgp(g)G (z,g), (1) 

where 

\nG Q {z,g) = N j dg'p(g') log(l - (1 - z)p{g,g')). (2) 

The propagator Go{k,g) giving the conditional probability that a vertex of hidden variable g 
is connected to k other vertices is given implicitly by 

oo 

G (z,g) = J2 z k G (k,g). (3) 

k=0 

Knowledge of Go{k,g) determines the conditional probability P(k'\k) that a node of degree k 
is connected to a node of degree k', [TT] (as well as pk), and those two functions completely 
define a Markovian network. Once Go(k,g) is the determined, all of the properties of the 
given network are determined. The most well-known example is the degree distribution p^\ 

Pk = / dgp x (g)G (k,g). (4) 



It would seem that determining Go(k, g) from Eq. (J2J) is a simple technical matter, but this 
is not the casepj]. The purpose of the present Section is to provide a simple, analytic and 
accurate method to determine Go(k,g). 

We obtain Go(k,g) from Eq. §2§ by using the tautology 

p(g, g') = P(g) + (p(g, g') - P(g)) (5) 

in Eq. ([2]), choosing p(g) so as to eliminate the effects of the second term, and then treating 
the remaining higher powers of (p(g,g') — p{g)) as an expansion parameter. Using Eq. ^ in 
Eq. yields 

\nG (z,g) = lnG (z,g) = log(l - (1 - z)p{g)) N - N(l - z) J dg' p(g') ^0^. 

« (i-zY r , , f p(g,g')-P(g)) \ n ^ 
~ N 5 ~nT- J d9p{9) [l-(1- z)p(g) ) ■ (6) 



n=2 



In analogy with the mean-field (Hartree) approximation of atomic and nuclear physics, we 
find that the second term of Eq. ^ vanishes if we choose p(g) to be the average of p(g, g') 
over p(g'): 

p(g) = J dg'p(g')p(g,g'). (7) 
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With Eq. ([7]) the effects of the term of first order in (p(g,g') — p(g)) vanish. We therefore 
obtain the result: 

\ad Q (z,g) = ]og{l-(l-zm)»-Nf: / dg'p(g') ( f^'^ J > («) 

with the putative term with n — 1 vanishing by virtue of Eq. ([7]). 

We treat the first term of Eq. ^ as the leading order (LO) term and regard the remainder 
as a correction. The validity of this approach can be checked by comparison with simulations, 
or (in certain cases) with analytic results. Numerical results for the PIN of current interest 
[12] indicate that the corrections to the LO terms induce errors in pk of no more than a few 
percent and that the approximation becomes more accurate for large values of k. Therefore 
we use the LO approximation. Using exponentiation and the binomial theorem in the first 
term of Eq. ^ leads to the result 

Gi LO \k,g) -p(g)) N - k p(g)\ (9) 

which is of the form of a random binomial distribution in which the connection probability 
depends on the hidden variable g. The Eq. ^ is our central new general result that can 
be used for any hidden variable network. This binomial distribution has both the normal 
Gaussian and Poisson (Np(g) <C 1) distributions as limiting cases. 



A. Sparse and Nonsparse Networks 

Ref. [11] explained the difference between sparse and nonsparse networks. Sparse networks 
have a well-defined thermodynamic limit for the average degree, while this quantity diverges 
as the network size iV approaches infinity. Ref. [UJ defines criteria for sparseness by pointing 
out the relevance of p of Eq. ^ in determining whether or not a network is sparse. Given 
this quantity the average degree is 

(k) = J dgp{g)p{g) = J ' dg J dg' p(g)p(g, g')p(g'). (10) 

If the p(g) is independent of N the only way to obtain a non-divergent value (k) is for the 
connection probability [llj to scale as N^ 1 : 

psparsc^^/) = C (9,9 ) ^ sparge netwQrk _ ( U ) 



Under the specific assumption that Eq. (11) holds, Ref. [UJ finds a very interesting result. In 
our notation, this amounts to using Eq. (11 ) in Eq. ^ and taking the limit that N approaches 



infinity. Then 

G s r 8e (z,g) = exp(^ - 1) J dg'p(g')C(g,g'). (12) 

This shows that the Poisson limit of Eq. ^ is obtained for the very special case of sparse 
networks in which the connection probability scales as iV -1 . None of the models of interest 
here [91 [121 H3] are sparse, so it is our present result (j9]) that is widely applicable. 
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B. General Networks 



Turning to the use of the use of the propagator, we obtain the degree distribution as 



Pk 



J dgp(g)Go(k ig ) « J dgp(g)Gi LO \k,g). (13) 



This expression can be thought of as averaging a binomial distribution over the hidden variable 
and is a natural generalization of classical graph theory. A similar expression for p^ has been 
obtained, in the Poisson limit, in Ref. [32]. In that work, p^ is presented as an integral of 
the Poisson distribution for p(g) multiplied by the "P representation" of a density matrix. 
Comparing Eq. ^ with the result (3) of [15] shows that our propagator is proportional to the 
P representation, essentially our p(g). Ref. [15] shows, how under certain assumptions, to use 
p(k) to determine the P representation. Our method allows underlying network properties, 
denoted by p(g) and p(g,g'), to predict various network properties. 

The clustering coefficient which measures transitivity [3] : if vertex A is connected to vertex 
B and vertex B to vertex C, there is an increased probability that vertices A and C are 
connected. In graph theory, the clustering coefficient c(k) is the ratio of the number of 
triangles to the number of pairs, computed for nodes of degree k. Ref. [H] shows that 

c(k) = - fdgp(g)G (k,g)c(g) (14) 
Pk J 

c( 9 ) = Jdg Jdg m p(g,g ) p{g) ■ (15) 
Our calculations replace G by Gq L °^ of Eq. fo). 



III. SIMPLE MODELS AND ANALYTIC RESULTS 

One way to verify the LO approximation is to show that it reproduces analytic results for 
previously published models. We consider the models of [9] and |13j in this section. In both 
of these models p(g,g') is taken as a step function (the temperature limit of our model): 

P(g,g') = e(g + g'-p). (16) 



The two models differ in their choice of p(g), but the use of Eq. (16) allows one to obtain 
compact general expressions for the generating functions G (z, g),G (k, g),Pk and c{k). We 
present these first and discuss specific details of the individual models in separate sub-sections. 
The use of Eq. ( 16 ) in Eq. ^ yields 



lnG (^) = iV 

so that 



Q(p-g) / dg'p(g')+e(g-fi) 



log(;z) = Np{g)log{z), (17) 



G (z } g)=z N ^\ (18) 
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It is interesting to observe that Eq. <\8h reduces to the above result. This is because powers 



of p(g,g') m = p(g,g') for Eq. (16), so that the integration appearing in Eq. (J8J) leads to an 
expression that is a function of N,z, p. Then the use of the binomial theorem allows the second 
term of Eq. p| to be expressed as a summable power series in p which ultimately leads to the 
result Eq. ( |18[ ). 

If we follow [UJ and treat A; as a continuous variable (which requires large values of k) we 
find 

(l9) 

(20) 

where (?7v(fc) is the solution of the equation 

k = Np(g). (21) 



Note that for k = N, gn{k) can take on any value greater than //. The result Eq. (19) is the 



same as eq.(34) of [IT] , but written in a more compact form. The use of Eq. (19) in Eq. (13) 



and Eq. (14) yields the results 



v, - P ^ k)) (22) 

Pk ~ N\p>(g N (k))\ [22) 

(k) _ c(g N (k)) 

C{ ) ~ N\p>(g N (k))\- (23) 

A. Model of Caldarelli et a/.[9] 

This model is defined by using p(g) = exp(-g), but we generalize to take the form 

p x (g) = Xex V (-Xg). (24) 

Ref. [TT] works out this model using their Green's function formalism. Our purpose here is 
to compare the results of our averaging approximation with their results. For this model the 
average interaction probability p(g) is given by 

POO 

p(g)= dg'Xexp[-\g'}Q(g + g'~i2) = Q(g-fi) + Q( f i-g)exp[~X(fi~g)}. (25) 
J o 



Then our approximation Eq. (j 13[) for the degree distribution p^ is given by 
Pk 



n \ r» 

k 



[ dg\exp[-Xg}exp[-kX(jj- g)}(l-exp[-X(/2- g)]) N k (26) 
t Jo 

Define the integration variable t = exp [— A(/i — g)] so that 

Pk>i =(») e-»( * N + 1 -W k -» - B„ {k - ,, N + 1 - *)) , ,28) 

(I — f ) N 

Pk=1 = Ne- X » K N 0) 2 F 1 (1, N; N + 1, 1 - t ) (29) 
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where 2 -Pi is the confluent hypergeometric function and B to is the incomplete Beta function 
(and with to = 1 the Beta function): 

BJa,b)= f" dtt^CL-tf- 1 ,B 1 (a,b) = B(a,b). (30) 
J 

Consider the case 

1 < k, A/x « 10, (31) 



(the latter is typical of our biological model) so that the second term of Eq. (28) can be 
neglected. Evaluating the remaining gamma functions gives 

^^W^ry (32) 

Ref. [UJ computes the degree distribution for this model in analytic manner, using the ap- 
proximation Eq. (19) in which k is treated as a continuous variable and therefore "is expected 
to perform poorly for small values of k" . The result of [TT] (pf P5 ) is 

k 2 

which corresponds to agreement (for k 7^ N) within the stated domain of accuracy of Ref. [TT] . 
The confluence of Eq. (32 ) and Eq. (33 ) provides a verification of the accuracy of the averaging 
approximation. 

The results for k = N seem to disagree, so we examine this more closely. Use Eq. (18) 
directly to obtain the generating function G (z) as G (z) = f dg p(g) z N ^ 9 '. One obtains a 
result z N for all values of g (g > fi) such that p(g) = 1. Using this generating function yields 
the result 

p k =N = J dgp(g)Q(g - /i). (34) 

The specific value of the integral depends on the choice of p(g), but the result is a finite 
number for any choice of p(g) that satisfies the normalization condition that its integral over 
its domain is unity. Thus we believe that the correct result of using the propagator (eq(34) 
of [H] in their eq(ll)) is 

V B k PS = e" A ^ (35) 



instead of Eq. (33), which is in agreement with our result. 

Our approximation works very well in reproducing the computed clustering coefficient of 
[TT] . In particular, we evaluate c(g) of Eq. (15) to find that 

-( r /2 eM-g)G { LO \k,g)+ f 
Vk \Jo J 11/2 



Kk) = - ( r /2 eM-9)G { LO \k,g)+ f exp(-g)Gi LO \k,g)(2g - p + . (36) 

Vk \J0 Ju/2 I 



Numerical evaluation of this approximate expression accurately reproduces the result of Fig. 3 
of Ref. [TT]. Thus our mean field approximation is accurate for both our model[12j and the 
model of Ref. [9], 
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TABLE I: Parameters obtained in Ref. p2] 





N 


\ 


H* 


H. pylori 


732 


0.88 


7.06 


P. falciparum 


1,310 


0.93 


7.77 


S. cerevisiae 


4,386 


1.18 


7.94 


C. elegans 


2,800 


1.29 


8.19 


D. melanogaster 


2,806 


1.53 


8.89 


Human [21 


1,494 


0.64 


10.6 


Human [22] 


1,705 


0.67 


10.2 



IV. PROTEIN PROTEIN INTERACTION NETWORK-MODEL OF SHI ET AL. [12] 



Our principal application is to the the PIN of Ref. [T2] . This model is based on the concept 
of free energy of association. For a given pair of proteins the association free energy (in units 
of RT) is assumed to deviate from an average value a number contributed by both proteins 
additively as g+g'. This is a unique approximation to first-order in g and g' . Thermodynamics 
and the assumption that the interaction probability is independent of concentration allows us 
to write 

p(g,g') = 1/(1 + e^'), (37) 

which reduces to a step function in the zero temperature limit, but otherwise provides a 
smooth function. Increasing the value of \x weakens the strength of interactions, and previous 
results [12] showed the existence of an evolutionary trend to weaker interactions in more 
complex organisms. The probability that a protein has a value of g is given by the probability 
distribution 

Px(g) = - e e~ Xg , -l<\g< +oo, (38) 

where the positive real value of A governs the fluctuations of g. We previously chose the species- 
dependent values of A and fi so as to reproduce measured degree distributions obtained using 
the yeast two-hybrid method (Y2H) that reports binary results for protein-protein binding 
under a controlled setting[16]. Those parameters are displayed in Table I. The impact of 
the parameters A and /i are explained in Ref. [12] and displayed in Fig. 3 of that reference. 
Increasing the value of A increases the causes a more rapid decrease of p^-the slope of pk 
increases in magnitude. Increasing the value of \i decreases the magnitude of pk without 
altering the slope much for values of k greater than about 10. The ability to vary both the 
slope and magnitude of pt gives this model flexibility that allows us to describe the available 
degree distributions for different species. 

We obtain an analytic form for the for p(g) Eq. ^ of this model. Given Eq. (38) and 



Eq. (37) we find an analytic result: 

p(g,\) = 2^(1, A; A + l;-exp (//-<?)), (39) 
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FIG. 1: (Color online) Average connection probability A = 1, fj, = 10. Solid (Red): result of Eq. (40); 
dashed (blue) (containing the step function) result of Eq. (25). The approach to unity is smooth for 
Eq. (|40b. 



where 2-^1 is the confluent hypergeometric function. The special case A = 1 yields a closed 
form expression: 

p^g) = e g -» \n(l + e^ 9 ). (40) 

A smooth average connection probability is obtained in contrast with the result of the sharp 
cutoff model Eq. (25). This shown in Fig. [TJ 
It is useful to define the variable 



f = exp {n-g)> 0, (41) 
and note that an integral representation 

2 F 1 (n, A; A + 1; -£) = A P dt t x ~\l + £t)" n , (42) 

Jo 

is convenient for numerical evaluations. 

Knowledge of the propagator Eq. ^ allows us to compute the clustering coefficients of 
diverse species. The resulting degree distributions of pk (shown for the sake of completeness) 
and the newly computed clustering coefficients c(k) for yeast S. cerevisiae [T7], worm C. ele- 
gans [T8] and fruit fly D. melanog aster [19] are shown in Fig. [21 The parameters A and fx are 
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FIG. 2: (Color online) Degree distributions pk and clustering coefficients C{k) of diverse species. 
Degree distributions pk- The solid (red) curves are derived from the LO theory. The black dots are 
the results of experimental data as referenced in the text. The small (blue) circles are the results of a 
numerical simulation using the procedure of [T2]. Clustering coefficients C{k): The solid (red) curves 
are derived from the LO theory. The small (blue) dots are the results of a numerical simulation using 
the procedure of [12 and the heavy (black) dots represent the measured data. 

those of [12] , so the calculations of the clustering coefficients represent an independent major 
new prediction of our model. Results of numerical simulations and our analytic procedure are 
presented. The excellent agreement between the two methods verifies the LO approximation. 
More importantly, the agreement between our calculations and the measured clustering co- 
efficients is generally very good, so our model survives a very significant test. This bolsters 
the notion that the properties of a PIN are determined by a distribution of free energy. The 
clustering coefficient for yeast drops rapidly for large values of k (where statistics are poor), 
a feature not contained in our model. 

It is worthwhile to compare our model with that of [13]. That work chooses a Gaussian 
form of p(g), based on hydrophobicity, a step function form of p(g,g'), and is applied only to 
yeast. We found [12] that pk of [13] is scale free only for a narrow range of parameters, and 
we could not reproduce the data for diverse species using that model. 

The human interactome is of special interest. Fig. [3]\ shows the human degree distributions 
computed with two sets of parameters, one from Ref. [T2] (Table I) and the other using values 
of A = 0.94, p = 8.27 shown in the caption. The degree distributions are essentially identical, 
so only one curve can be shown. Each is approximately of a power law form and each describes 
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the measured degree distribution very well[20j. Calculations of degree correlations allows one 
to distinguish the two parameter sets. Figure |3]B shows that the cluster coefficients differ by 
a factor of two. We find that c(k) decreases substantially as A increases. The increase in A 
reduces the allowed spread in the value of g and reduces the value of integrand of Eq. ( 14 ) . 
It is interesting to note that the two existing measurements of the human c(k) differ by a 
factor of about an order of magnitude with the measurements of Ref.|22j obtaining much 
smaller values than those of [21]. The results of [21] are closer to our computed c(k) results 
for A = 0.94, n = 8.3. In contrast with the results for other species, our c(k) lie significantly 
above the data. However, the two data sets disagree substantially (by a factor of as much as 
100 for certain values of k) and both show a clustering coefficient that is generally significantly 
smaller than that of the other species. Several possibilities may account for the discrepancies 
between these two measurements of c(k) in humans and also for the differences between our 
model predictions and the experimental results, i) The human studies sample a limited subset 
of links of the complete network and this could bias the results, ii) The human protein subsets 
used in the two studies differ, iii) The human interactome is truly less connected than that of 
other species. This demonstrates the importance of measuring degree correlations to determine 
the underlying properties of the network. The current model and these considerations suggest 
the need for better design of future PIN studies that will not only include other species, 
but also comparisons between the PINs of different organs of a given species. Furthermore, 
comparisons between normal and malignant tissues could also be very fruitful. 



V. SUMMARY AND DISCUSSION 

In summary, this work provides a method to obtain the properties of hidden variable 
network models. The use of the approximation Eq. ([7]), used to obtain the propagator Eq. 
provides an excellent numerical approximation to exact results for the models considered 
here. If necessary, the method can be systematically improved through the calculation of 
higher order corrections. Our principal example is the PIN of Ref. pj2]. Not only does the 
use of Eq. ^ provide an accurate numerical result, but the model correctly predicts the 
clustering coefficients of most species. For the human interactome, two different parameter 
sets yield nearly the same degree distribution but very different clustering coefficients, showing 
the importance of measuring degree correlations to determine the underlying nature of the 
network. 

This work was supported in part by National Institutes of Health Grants GM45134 and 
DK45978 (to K.B.). We thank the authors of Refs. [2D E2] for providing tables of their data. 
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